73 research outputs found

    Learning a local-variable model of aromatic and conjugated systems

    Get PDF
    A collection of new approaches to building and training neural networks, collectively referred to as deep learning, are attracting attention in theoretical chemistry. Several groups aim to replace computationally expensive <i>ab initio</i> quantum mechanics calculations with learned estimators. This raises questions about the representability of complex quantum chemical systems with neural networks. Can local-variable models efficiently approximate nonlocal quantum chemical features? Here, we find that convolutional architectures, those that only aggregate information locally, cannot efficiently represent aromaticity and conjugation in large systems. They cannot represent long-range nonlocality known to be important in quantum chemistry. This study uses aromatic and conjugated systems computed from molecule graphs, though reproducing quantum simulations is the ultimate goal. This task, by definition, is both computable and known to be important to chemistry. The failure of convolutional architectures on this focused task calls into question their use in modeling quantum mechanics. To remedy this heretofore unrecognized deficiency, we introduce a new architecture that propagates information back and forth in waves of nonlinear computation. This architecture is still a local-variable model, and it is both computationally and representationally efficient, processing molecules in sublinear time with far fewer parameters than convolutional networks. Wave-like propagation models aromatic and conjugated systems with high accuracy, and even models the impact of small structural changes on large molecules. This new architecture demonstrates that some nonlocal features of quantum chemistry can be efficiently represented in local variable models

    Securely measuring the overlap between private datasets with cryptosets

    Get PDF
    Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure

    Opportunities and obstacles for deep learning in biology and medicine

    Get PDF
    Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network\u27s prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine

    Modeling reactivity to biological macromolecules with a deep multitask network

    Get PDF
    Most small-molecule drug candidates fail before entering the market, frequently because of unexpected toxicity. Often, toxicity is detected only late in drug development, because many types of toxicities, especially idiosyncratic adverse drug reactions (IADRs), are particularly hard to predict and detect. Moreover, drug-induced liver injury (DILI) is the most frequent reason drugs are withdrawn from the market and causes 50% of acute liver failure cases in the United States. A common mechanism often underlies many types of drug toxicities, including both DILI and IADRs. Drugs are bioactivated by drug-metabolizing enzymes into reactive metabolites, which then conjugate to sites in proteins or DNA to form adducts. DNA adducts are often mutagenic and may alter the reading and copying of genes and their regulatory elements, causing gene dysregulation and even triggering cancer. Similarly, protein adducts can disrupt their normal biological functions and induce harmful immune responses. Unfortunately, reactive metabolites are not reliably detected by experiments, and it is also expensive to test drug candidates for potential to form DNA or protein adducts during the early stages of drug development. In contrast, computational methods have the potential to quickly screen for covalent binding potential, thereby flagging problematic molecules and reducing the total number of necessary experiments. Here, we train a deep convolution neural networkthe XenoSite reactivity modelusing literature data to accurately predict both sites and probability of reactivity for molecules with glutathione, cyanide, protein, and DNA. On the site level, cross-validated predictions had area under the curve (AUC) performances of 89.8% for DNA and 94.4% for protein. Furthermore, the model separated molecules electrophilically reactive with DNA and protein from nonreactive molecules with cross-validated AUC performances of 78.7% and 79.8%, respectively. On both the site- and molecule-level, the model’s performances significantly outperformed reactivity indices derived from quantum simulations that are reported in the literature. Moreover, we developed and applied a selectivity score to assess preferential reactions with the macromolecules as opposed to the common screening traps. For the entire data set of 2803 molecules, this approach yielded totals of 257 (9.2%) and 227 (8.1%) molecules predicted to be reactive only with DNA and protein, respectively, and hence those that would be missed by standard reactivity screening experiments. Site of reactivity data is an underutilized resource that can be used to not only predict if molecules are reactive, but also show where they might be modified to reduce toxicity while retaining efficacy. The XenoSite reactivity model is available at http://swami.wustl.edu/xenosite/p/reactivity

    Large scale study of multiple-molecule queries

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In ligand-based screening, as well as in other chemoinformatics applications, one seeks to effectively search large repositories of molecules in order to retrieve molecules that are similar typically to a single molecule lead. However, in some case, multiple molecules from the same family are available to seed the query and search for other members of the same family.</p> <p>Multiple-molecule query methods have been less studied than single-molecule query methods. Furthermore, the previous studies have relied on proprietary data and sometimes have not used proper cross-validation methods to assess the results. In contrast, here we develop and compare multiple-molecule query methods using several large publicly available data sets and background. We also create a framework based on a strict cross-validation protocol to allow unbiased benchmarking for direct comparison in future studies across several performance metrics.</p> <p>Results</p> <p>Fourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and BEDROC metrics.</p> <p>Consistent with the previous literature, the best parameter-free methods are the MAX-SIM and MIN-RANK methods, which score a molecule to a family by the maximum similarity, or minimum ranking, obtained across the family. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (<b>BKD</b>), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data.</p> <p>Conclusion</p> <p>Fourteen methods for multiple-molecule querying of chemical databases, including novel methods, (ETD) and (TPD), are validated using publicly available data sets, standard cross-validation protocols, and established metrics. The best results are obtained with ETD, TPD, BKD, MAX-SIM, and MIN-RANK. These results can be replicated and compared with the results of future studies using data freely downloadable from <url>http://cdb.ics.uci.edu/</url>.</p

    Development and validation of a deep learning model to quantify glomerulosclerosis in kidney biopsy specimens

    Get PDF
    Importance: A chronic shortage of donor kidneys is compounded by a high discard rate, and this rate is directly associated with biopsy specimen evaluation, which shows poor reproducibility among pathologists. A deep learning algorithm for measuring percent global glomerulosclerosis (an important predictor of outcome) on images of kidney biopsy specimens could enable pathologists to more reproducibly and accurately quantify percent global glomerulosclerosis, potentially saving organs that would have been discarded. Objective: To compare the performances of pathologists with a deep learning model on quantification of percent global glomerulosclerosis in whole-slide images of donor kidney biopsy specimens, and to determine the potential benefit of a deep learning model on organ discard rates. Design, Setting, and Participants: This prognostic study used whole-slide images acquired from 98 hematoxylin-eosin-stained frozen and 51 permanent donor biopsy specimen sections retrieved from 83 kidneys. Serial annotation by 3 board-certified pathologists served as ground truth for model training and for evaluation. Images of kidney biopsy specimens were obtained from the Washington University database (retrieved between June 2015 and June 2017). Cases were selected randomly from a database of more than 1000 cases to include biopsy specimens representing an equitable distribution within 0% to 5%, 6% to 10%, 11% to 15%, 16% to 20%, and more than 20% global glomerulosclerosis. Main Outcomes and Measures: Correlation coefficient (r) and root-mean-square error (RMSE) with respect to annotations were computed for cross-validated model predictions and on-call pathologists\u27 estimates of percent global glomerulosclerosis when using individual and pooled slide results. Data were analyzed from March 2018 to August 2020. Results: The cross-validated model results of section images retrieved from 83 donor kidneys showed higher correlation with annotations (r = 0.916; 95% CI, 0.886-0.939) than on-call pathologists (r = 0.884; 95% CI, 0.825-0.923) that was enhanced when pooling glomeruli counts from multiple levels (r = 0.933; 95% CI, 0.898-0.956). Model prediction error for single levels (RMSE, 5.631; 95% CI, 4.735-6.517) was 14% lower than on-call pathologists (RMSE, 6.523; 95% CI, 5.191-7.783), improving to 22% with multiple levels (RMSE, 5.094; 95% CI, 3.972-6.301). The model decreased the likelihood of unnecessary organ discard by 37% compared with pathologists. Conclusions and Relevance: The findings of this prognostic study suggest that this deep learning model provided a scalable and robust method to quantify percent global glomerulosclerosis in whole-slide images of donor kidneys. The model performance improved by analyzing multiple levels of a section, surpassing the capacity of pathologists in the time-sensitive setting of examining donor biopsy specimens. The results indicate the potential of a deep learning model to prevent erroneous donor organ discard

    Machine learning liver-injuring drug interactions with non-steroidal anti-inflammatory drugs (NSAIDs) from a retrospective electronic health record (EHR) cohort

    Get PDF
    Drug-drug interactions account for up to 30% of adverse drug reactions. Increasing prevalence of electronic health records (EHRs) offers a unique opportunity to build machine learning algorithms to identify drug-drug interactions that drive adverse events. In this study, we investigated hospitalizations\u27 data to study drug interactions with non-steroidal anti-inflammatory drugs (NSAIDS) that result in drug-induced liver injury (DILI). We propose a logistic regression based machine learning algorithm that unearths several known interactions from an EHR dataset of about 400,000 hospitalization. Our proposed modeling framework is successful in detecting 87.5% of the positive controls, which are defined by drugs known to interact with diclofenac causing an increased risk of DILI, and correctly ranks aggregate risk of DILI for eight commonly prescribed NSAIDs. We found that our modeling framework is particularly successful in inferring associations of drug-drug interactions from relatively small EHR datasets. Furthermore, we have identified a novel and potentially hepatotoxic interaction that might occur during concomitant use of meloxicam and esomeprazole, which are commonly prescribed together to allay NSAID-induced gastrointestinal (GI) bleeding. Empirically, we validate our approach against prior methods for signal detection on EHR datasets, in which our proposed approach outperforms all the compared methods across most metrics, such as area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC)

    Deep learning quantification of percent steatosis in donor liver biopsy frozen sections

    Get PDF
    BACKGROUND: Pathologist evaluation of donor liver biopsies provides information for accepting or discarding potential donor livers. Due to the urgent nature of the decision process, this is regularly performed using frozen sectioning at the time of biopsy. The percent steatosis in a donor liver biopsy correlates with transplant outcome, however there is significant inter- and intra-observer variability in quantifying steatosis, compounded by frozen section artifact. We hypothesized that a deep learning model could identify and quantify steatosis in donor liver biopsies. METHODS: We developed a deep learning convolutional neural network that generates a steatosis probability map from an input whole slide image (WSI) of a hematoxylin and eosin-stained frozen section, and subsequently calculates the percent steatosis. Ninety-six WSI of frozen donor liver sections from our transplant pathology service were annotated for steatosis and used to train (n = 30 WSI) and test (n = 66 WSI) the deep learning model. FINDINGS: The model had good correlation and agreement with the annotation in both the training set (r of 0.88, intraclass correlation coefficient [ICC] of 0.88) and novel input test sets (r = 0.85 and ICC=0.85). These measurements were superior to the estimates of the on-service pathologist at the time of initial evaluation (r = 0.52 and ICC=0.52 for the training set, and r = 0.74 and ICC=0.72 for the test set). INTERPRETATION: Use of this deep learning algorithm could be incorporated into routine pathology workflows for fast, accurate, and reproducible donor liver evaluation. FUNDING: Mid-America Transplant Society

    Discovery of novel reductive elimination pathway for 10-hydroxywarfarin

    Get PDF
    Coumadin (R/S-warfarin) anticoagulant therapy is highly efficacious in preventing the formation of blood clots; however, significant inter-individual variations in response risks over or under dosing resulting in adverse bleeding events or ineffective therapy, respectively. Levels of pharmacologically active forms of the drug and metabolites depend on a diversity of metabolic pathways. Cytochromes P450 play a major role in oxidizing R- and S-warfarin to 6-, 7-, 8-, 10-, and 4\u27-hydroxywarfarin, and warfarin alcohols form through a minor metabolic pathway involving reduction at the C11 position. We hypothesized that due to structural similarities with warfarin, hydroxywarfarins undergo reduction, possibly impacting their pharmacological activity and elimination. We modeled reduction reactions and carried out experimental steady-state reactions with human liver cytosol for conversion o
    • …
    corecore